Sklearn comes with multiple preloaded datasets for data manipulation, regression, or classification. They are loaded with the following commands

Classification datasets:

  • iris (4 features - set of measurements of flowers - 3 possible flower species)
  • breast_cancer (features describing malignant and benign cell nuclei)
  • digits (hand-written digits stored as 64 numerical array representing 8x8 black/white images)
  • wine (13 numeric features - 3 possibile wine classes)

Regression datsets:

  • boston (13 numeric/categorical features - predict housing prices from boston)
  • diabetes (10 numeric features - used to predict disease progression)

Multivariate regression:

  • linnerud (3 numeric features - phsyical exercises - 3 numeric observations on weight, waist, pulse)

Loading dataset:

from sklearn.datasets import load_name

name = load_name()


In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

Accessing dataset:

To see this options, type iris. then tab after importing.

Select any of the following:

iris.data
iris.DESCR
iris.feature_names
iris.target
iris.target_names

Examining dataset


In [2]:
import pandas as pd

In [3]:
iris_features_df = pd.DataFrame(data=iris.data,
                               columns=iris.feature_names)

iris_features_df.head(2)


Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2

In [4]:
iris_target_df = pd.DataFrame(data=iris.target,
                               columns=["Species"])

iris_target_df.head(2)


Out[4]:
Species
0 0
1 0

In [5]:
list(iris.target_names) #0 - setosa, 1 - versicolor, 2- virginica


Out[5]:
['setosa', 'versicolor', 'virginica']

Printing description of dataset


In [6]:
print(iris.DESCR)


Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Feature Names

Note: digits has no feature_names


In [7]:
from sklearn.datasets import load_breast_cancer,load_boston,load_diabetes,load_linnerud,load_digits

In [8]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Boston':load_boston(),
            'Diabetes':load_diabetes(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].feature_names))


** Iris **
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

** Breast Cancer **
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

** Boston **
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

** Diabetes **
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

** Linnerud **
['Chins', 'Situps', 'Jumps']

Target Names


In [9]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Digits':load_digits(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].target_names))


** Iris **
['setosa' 'versicolor' 'virginica']

** Breast Cancer **
['malignant' 'benign']

** Digits **
[0 1 2 3 4 5 6 7 8 9]

** Linnerud **
['Weight', 'Waist', 'Pulse']

Shapes


In [10]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Boston':load_boston(),'Digits':load_digits(),
            'Diabetes':load_diabetes(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].data.shape))


** Iris **
(150, 4)

** Breast Cancer **
(569, 30)

** Boston **
(506, 13)

** Digits **
(1797, 64)

** Diabetes **
(442, 10)

** Linnerud **
(20, 3)